Library Imports

from pyspark.sql import SparkSession
from pyspark.sql import functions as F

from datetime import date

Template

spark = (
    SparkSession.builder
    .master("local")
    .appName("Section 2.4 - Constant Values")
    .config("spark.some.config.option", "some-value")
    .getOrCreate()
)

sc = spark.sparkContext

import os

data_path = "/data/pets.csv"
base_path = os.path.dirname(os.getcwd())
path = base_path + data_path
pets = spark.read.csv(path, header=True)
pets.toPandas()
id breed_id nickname birthday age color
0 1 1 King 2014-11-22 12:30:31 5 brown
1 2 3 Argus 2016-11-22 10:05:10 10 None
2 3 1 Chewie 2016-11-22 10:05:10 15 None

Constant Values

There are many instances where you will need to create a column expression or use a constant value to perform some of the spark transformations. We'll explore some of these.

Case 1: Creating a Column with a constant value (withColumn()) (wrong)

pets.withColumn('todays_date', date.today()).toPandas()
---------------------------------------------------------------------------

AssertionError                            Traceback (most recent call last)

<ipython-input-4-f87e239cb534> in <module>()
----> 1 pets.withColumn('todays_date', date.today()).toPandas()


/usr/local/lib/python2.7/site-packages/pyspark/sql/dataframe.pyc in withColumn(self, colName, col)
   1846 
   1847         """
-> 1848         assert isinstance(col, Column), "col should be Column"
   1849         return DataFrame(self._jdf.withColumn(colName, col._jc), self.sql_ctx)
   1850 


AssertionError: col should be Column

What Happened?

Spark functions that have a col as an argument will usually require you to pass in a Column expression. As seen in the previous section, withColumn() worked fine when we gave it a column from the current df. But this isn't the case when we want set a column to a constant value.

If you get an AssertionError: col should be Column that is usually the case, we'll look into how to fix this.

Case 1: Creating a Column with a constant value (withColumn()) (correct)

pets.withColumn('todays_date', F.lit(date.today())).toPandas()
id breed_id nickname birthday age color todays_date
0 1 1 King 2014-11-22 12:30:31 5 brown 2019-02-14
1 2 3 Argus 2016-11-22 10:05:10 10 None 2019-02-14
2 3 1 Chewie 2016-11-22 10:05:10 15 None 2019-02-14

What Happened?

With F.lit() you can create a column expression that you can now assign to a new column in your dataframe.

More Examples

(
    pets
    .withColumn('age_greater_than_5', F.col("age") > 5)
    .withColumn('height', F.lit(150))
    .where(F.col('breed_id') == 1)
    .where(F.col('breed_id') == F.lit(1))
    .toPandas()
)
id breed_id nickname birthday age color age_greater_than_5 height
0 1 1 King 2014-11-22 12:30:31 5 brown False 150
1 3 1 Chewie 2016-11-22 10:05:10 15 None True 150

What Happened?

(We will look into equilities statements later.)

The above contains constant values (column height) and column expressions (columns using F.col()) so a F.lit() is not required.

Summary

  • You need to use F.lit() to assign constant values to columns.
  • Equality expressions with F.col() is also another way to have a column expressions.
  • When in doubt, always use column expressions F.lit().

results matching ""

    No results matching ""